processing and communication application conference
TurkishBERTweet: Fast and Reliable Large Language Model for Social Media Analysis
Turkish is one of the most popular languages in the world. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using almost 900 million tweets. The model shares the same architecture as base BERT model with smaller input length, making TurkishBERTweet lighter than BERTurk and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two text classification tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also compared our models with the commercial OpenAI solutions in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective solution. As part of our research, we released TurkishBERTweet and fine-tuned LoRA adapters for the mentioned tasks under the MIT License to facilitate future research and applications on Turkish social media. Our TurkishBERTweet model is available at: https://github.com/ViralLab/TurkishBERTweet
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Republic of Türkiye (0.04)
- Europe > Germany > Berlin (0.04)
- (2 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Comparison of Pre-trained Language Models for Turkish Address Parsing
Ünal, Muhammed Cihat, Aygün, Betül, Gerek, Aydın
Transformer based pre-trained models such as BERT and its variants, which are trained on large corpora, have demonstrated tremendous success for natural language processing (NLP) tasks. Most of academic works are based on the English language; however, the number of multilingual and language specific studies increase steadily. Furthermore, several studies claimed that language specific models outperform multilingual models in various tasks. Therefore, the community tends to train or fine-tune the models for the language of their case study, specifically. In this paper, we focus on Turkish maps data and thoroughly evaluate both multilingual and Turkish based BERT, DistilBERT, ELECTRA and RoBERTa. Besides, we also propose a MultiLayer Perceptron (MLP) for fine-tuning BERT in addition to the standard approach of one-layer fine-tuning. For the dataset, a mid-sized Address Parsing corpus taken with a relatively high quality is constructed. Conducted experiments on this dataset indicate that Turkish language specific models with MLP fine-tuning yields slightly better results when compared to the multilingual fine-tuned models. Moreover, visualization of address tokens' representations further indicates the effectiveness of BERT variants for classifying a variety of addresses.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.05)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
Challenges Encountered in Turkish Natural Language Processing Studies
It aims to analyze a language element such as writing or speaking with software and convert it into information. Considering that each language has its own grammatical rules and vocabulary diversity, the complexity of the studies in this field is somewhat understandable. For instance, Turkish is a very interesting language in many ways. Examples of this are agglutinative word structure, consonant/vowel harmony, a large number of productive derivational morphemes (practically infinite vocabulary), derivation and syntactic relations, a complex emphasis on vocabulary and phonological rules. In this study, the interesting features of Turkish in terms of natural language processing are mentioned. In addition, summary info about natural language processing techniques, systems and various sources developed for Turkish are given. Keywords: Natural language processing, Turkish natural language processing, NLP Article history: Received 06 June 2020, Accepted 26 November 2020, Available online 27 November 2020 Introduction Language is undoubtedly the main factor in communication between people. Natural language processing studies aim at the most effective use of language factor in humancomputer communication. Natural Language Processing is a subcategory of artificial intelligence and linguistics.
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.04)
- Europe > Netherlands > Utrecht (0.04)
- Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.04)
- Asia > Middle East > Republic of Türkiye > Hatay Province > Iskenderun (0.04)
A Text Classification Application: Poet Detection from Poetry
Sahin, Durmus Ozkan, Kural, Oguz Emre, Kilic, Erdal, Karabina, Armagan
With the widespread use of the internet, the size of the text data increases day by day. Poems can be given as an example of the growing text. In this study, we aim to classify poetry according to poet. Firstly, data set consisting of three different poetry of poets written in English have been constructed. Then, text categorization techniques are implemented on it. Chi-Square technique are used for feature selection. In addition, five different classification algorithms are tried. These algorithms are Sequential minimal optimization, Naive Bayes, C4.5 decision tree, Random Forest and k-nearest neighbors. Although each classifier showed very different results, over the 70% classification success rate was taken by sequential minimal optimization technique.
- Asia > Middle East > Republic of Türkiye > Zonguldak Province > Zonguldak (0.05)
- Asia > Middle East > Republic of Türkiye > Trabzon Province > Trabzon (0.05)
- Asia > Middle East > Republic of Türkiye > Mugla Province > Mugla (0.05)
- (5 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.57)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)